This weeks work has been mainly related to data wrangling and running small regression analyses
lrn14 <- read.table("~/Documents/GitHub/IODS-project/data/learning2014.txt", header = T, sep = "\t")
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
So the data has 166 rows/individuals and 7 variables. Gender is of type factor and Age and Points integers. Rest of the data has numeric values
Understanding the relationship between various variables
library(GGally) # Loading library for plotting
library(ggplot2) # Loading library for plotting
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
This plot shows the distribution of the variables. The colours refer to gender and show the correlation between each of the variables. Surface questions seem to be negatively correlated with all of the variables. Highest correlation is observed between points and attitude. There is difference between females and males overall as suggested by the density plots.
Fitting a linear regression model
my_model <- lm(Points ~ attitude + stra + surf, data = lrn14)
summary(my_model)
##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
final_model <- lm(Points ~ attitude, data = lrn14)
summary(final_model)
##
## Call:
## lm(formula = Points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
A linear regression model where points is the dependent variable and attitude, strategic questions and surface questions are the independent variables. Based on the summary, only attitude seems to be significantly associated with Points. Surface questions has negative correlation with Points as can be seen from the intercept. Based on the above interpretation, we create a final model which has only attitude in it. So basically, in the final model, we see that for every unit increase attitude, we see the Points increase by 3.53. Multiple R squared signifies how close are the points to the fitted model.
Performing model validation:
par(mfrow = c(2,2))
plot(my_model, which = 1)
plot(my_model, which = 2)
plot(my_model, which = 5)
This week’s work has been mainly related to performing logistic regression in R
Reading in the file & loading the required libraries
alc <- read.table("~/Documents/GitHub/IODS-project/data/alcohol_consumption.txt", header = T, sep = "\t")
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Learning more about the data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc)
## [1] 382 35
As can be seen above, this file has information about 382 individuals with 35 variables for each student. This data is regarding alcohol consumption and high school performance of two Portugese schools. This data also includes demographic information, student grades, social and school related features. We will perform a logistic regression comparing students drinking high amounts of alcohol vs students drinking low amounts of alcohol
Take a guess
Exploring the data!
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: high_use [?]
## high_use sex count mean_grade
## <lgl> <fctr> <int> <dbl>
## 1 FALSE F 156 11.39744
## 2 FALSE M 112 12.20536
## 3 TRUE F 42 11.71429
## 4 TRUE M 72 10.27778
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("Grades") + ggtitle("Absences")
g2 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("Grades") + ggtitle("Grades")
g3 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot() + ylab("Freetime") + ggtitle("Freetime")
grid.arrange(g3, g1, g2, ncol = 2)
Based on the above plots, it seems that all the averages for the variables seem to be much more different in males than in females.
Performing logistic regression
mod1 <- glm(high_use ~ freetime + absences + G3 + sex, data = alc, family = "binomial")
summary(mod1)
##
## Call:
## glm(formula = high_use ~ freetime + absences + G3 + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0212 -0.8299 -0.6063 1.0926 2.1162
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.87119 0.61712 -3.032 0.002428 **
## freetime 0.28229 0.12544 2.250 0.024423 *
## absences 0.09394 0.02293 4.097 4.18e-05 ***
## G3 -0.07348 0.03610 -2.036 0.041778 *
## sexM 0.88194 0.24645 3.579 0.000346 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 420.33 on 377 degrees of freedom
## AIC: 430.33
##
## Number of Fisher Scoring iterations: 4
OR <- exp(coef(mod1))
CI <- exp(confint(mod1))
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1539399 0.04471733 0.5058444
## freetime 1.3261680 1.03978990 1.7022310
## absences 1.0984987 1.05237029 1.1515837
## G3 0.9291511 0.86511705 0.9970999
## sexM 2.4155823 1.49749137 3.9428978
As can be seen from above, all the speculative variables seem to be associated with high alcohol usage. The largest being for male, where the odds suggest that a male student is 2.42 times more likely to consume high_alcohol as compared to a female student. Freetime and absences also have odds ratio more than 1 indicating they are directly correlated with high alcohol consumption. Grades as expected are inversely correlated with high alcohol consumption.
Testing the predictive power of the model based on these 4 variables
probabilities <- predict(mod1, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.66230366 0.03926702
## TRUE 0.22513089 0.07329843
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
As can be seen from the table and the plot, this model isn’t optimal. The model error is ~25%. Furthermore, there is a high number of false negatives as compared to false positives.
This week’s work is related to classifying and clustering data
Loading the dataset and the required libraries
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.3.4 ✔ purrr 0.2.4
## ✔ readr 1.1.1 ✔ stringr 1.2.0
## ✔ tibble 1.3.4 ✔ forcats 0.2.0
## Warning: package 'purrr' was built under R version 3.4.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
library(dplyr)
library(ggplot2)
data("Boston")
Exploring the structure and dimension of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset contains information about housing in Boston, Mass. It’s a relatively small dataset with only 506 observations and contains 14 variables. Some of these variables include crime rates, proportion of residential land zoned, nitric oxide concentration, average number of rooms per dwelling and so on.
Graphical overview of the dataset
#pairs(Boston)
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
cor_matrix<-cor(Boston)
corrplot(cor_matrix %>% round(2), method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
As can be seen above, the variables follow different kinds of distributions. We have also calculated the correlations between these distributions. The highest correlation is between the variables “rad” and “tax” where rad stands for accessibility to radial highways. There also seems to be a high negative correlation between proportion of owner occupied units built prior to 1940 and weighted distance to one of the 5 working centres in Boston represented by variables “age” and “dis”.
Standardizing the dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Scaling helps in normalizing the continuous variables. This basically means that we subtract mean from all the values and divide by the standard deviation of that variable. Furthermore, we have created a categorical variable for crime using the quantiles as break points. We have further divided the dataset into training (80%) and test sets (20%).
Creating Linear Discriminant Analysis model
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2500000 0.2450495 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.94157733 -0.8854392 -0.0793339581 -0.8595014 0.4207280
## med_low -0.05953014 -0.2754450 0.0005392655 -0.5603640 -0.1678177
## med_high -0.39975350 0.1661412 0.1651265141 0.4104981 0.1064878
## high -0.48724019 1.0171096 -0.0407349362 1.0657879 -0.4146501
## age dis rad tax ptratio
## low -0.8725460 0.8952815 -0.7015101 -0.7252251 -0.41736515
## med_low -0.2962204 0.3810652 -0.5440892 -0.4542263 -0.07138447
## med_high 0.3974486 -0.3905543 -0.4227186 -0.3256471 -0.32565648
## high 0.7950240 -0.8461146 1.6382099 1.5141140 0.78087177
## black lstat medv
## low 0.37813681 -0.773093789 0.47804237
## med_low 0.34637117 -0.091645952 -0.03187986
## med_high 0.08284963 -0.002593142 0.16798456
## high -0.73281093 0.905063375 -0.76158483
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.115848123 0.69800600 -0.93669491
## indus -0.003834602 -0.12495330 0.39206741
## chas -0.100687681 -0.01246109 0.10257318
## nox 0.363121739 -0.79262489 -1.37812732
## rm -0.106355361 -0.01552473 -0.18162807
## age 0.225537306 -0.29299748 0.02186304
## dis -0.094930837 -0.19880548 0.32858141
## rad 3.377694077 0.85736482 -0.10316839
## tax 0.078274925 0.02066276 0.56895311
## ptratio 0.140676465 0.02712051 -0.34689809
## black -0.127069432 0.07960893 0.18405788
## lstat 0.263055910 -0.26153421 0.45909712
## medv 0.240816226 -0.55423541 -0.13247206
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9552 0.0327 0.0120
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Predicting on the test data
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 7 2 0
## med_low 6 15 4 0
## med_high 1 8 16 2
## high 0 0 0 25
Using the model fitted on the training data, we predict the classes for the test data and further calculated the accuracy of the model by comparing it to the real classes / data. We observe that we perform really well for the “high” class / the values lying in the 4th quantile. We accurately predict all well with 3 false positives. For the other classes, our prediction is not really good. Especially so for classes “med_low” and “med_high”
Clustering the data
library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
dist_eu <- dist(boston_scaled, method = "euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
So we tried clustering the Boston dataset. We started off with not knowing what the ideal number of clusters would be. We first calculate the euclidean distances. We then create different models with 1 - 10 clusters. As can be seen from the plot, the total within sum of squares sees the biggest drop between 1 and 2 hence it seems that the ideal number of cluster for this dataset is 2. Based on the last plot, we see that some variables along with their interactions are very good at clustering such as the variable crime, whereas others aren’t such as lstat.
This week we will explore dimensionality reduction techniques including PCA and MCA
Loading the data and the required libraries for the analysis
library(corrplot)
library(dplyr)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.2
library(GGally)
library(ggplot2)
library(tidyr)
human <- read.table("~/Documents/GitHub/IODS-project/data/human.txt", header = T, sep = "\t", row.names = 1)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
Graphical overview of the data
ggpairs(human)
cor(human) %>% corrplot(title = "Correlation plot", type = "upper", tl.cex = 0.6)
Performing Principal Component Analysis
pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8,1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Performing Principal Component Analysis (Standardized data)
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
biplot(pca_human_std, xlab = "PC1 (53.6%)", ylab = "PC2 (16.2%)",choices = 1:2, cex = c(0.8,1), col = c("grey40", "deeppink2"))
Multiple Correspondance Analysis
library(FactoMineR)
library(dplyr)
library(ggplot2)
data(tea)
#keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
#tea_time <- select(tea, keep_columns)
tea_time <- data.frame(tea$Tea, tea$How, tea$how, tea$sugar, tea$where, tea$lunch)
summary(tea_time)
## tea.Tea tea.How tea.how tea.sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## tea.where tea.lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ tea.Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ tea.How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ tea.how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ tea.where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ tea.lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.Tea | 0.126 0.108 0.410 |
## tea.How | 0.076 0.190 0.394 |
## tea.how | 0.708 0.522 0.010 |
## tea.sugar | 0.065 0.001 0.336 |
## tea.where | 0.702 0.681 0.055 |
## tea.lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("var"))
cats = apply(tea_time, 2, function(x) nlevels(as.factor(x)))
mca_obs_df = data.frame(mca$ind$coord)
mca_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats), cats))
ggplot(data=mca_vars_df, aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df))) + geom_hline(yintercept = 0, colour = "black") + geom_vline(xintercept = 0, colour = "black") + geom_text(aes(colour=Variable)) + ggtitle("MCA plot")